-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework on ParquetDataset for easy access and better cache size in eager mode #384
Conversation
/cc @terrytangyuan @BryanCutler @feihugis /cc @CaptainDuke in case you are interested. I am thinking about apply similar enhancement to HDF5 as well. |
Many thanks to Yongtang. Yes, actually contents in HDF5 files do not need to decode. Also I'm working on HDF5 files with diffierent size. For example.
I believe such enhancement would be helpful.
|
@CaptainDuke the issue #342 you are referring to, might not be directly related to this problem. However, the recent changes in upstream tf.data: tensorflow/tensorflow@c5c1839 might make things complicated as we likely will need to update API pretty soon. With the ongoing rework of cache size and tf.io pipeline to interact with tf.data, it might make sense to fix that together with the PR here. |
2da335c
to
31b7243
Compare
65e24f1
to
378b9c2
Compare
…er mode This fix is part of the effort to improve overall Dataset for easy access and better cache size in eager mode. See 382 and 366 for related discussions. In order to be able to read file either in filename or in mmeory, this PR adds an SizedRandomAccessFile which allows to provide an optional memory buffer as file content. This could be useful in process compression or archives where we could just read the uncompressed file content into memory. The preivous limitation in Dataset was that Dataset was a iterable so sequence length is unknown until graph runtime. In this PR, we provide an helper function to read the specs of parquet file and lenth is know. This also could open other avenues such as map parquet file with __getitem__ and __len__. Further, parquet file could be read into a Tensor and processed easily (such as pandas like API). The read_parquet_specs could be similarly applied to HDF5 which is more important: HDF5 could have dataset with different sizes. Summary: 1) Two basic C++ kernel ops are implemnted: read_parquet_specs and read_parquet 2) One ParquetDataset that is python implementation only (no C++ anymore) 3) ParquetDataset support eager and graph mode, in graph mode, dtype and shape are provided by user explicitly. In eager mode, only column name is needed. 4) read_parquet works in eager and graph mode, can read records either in full, or in slices 5) read_parquet_specs works in eager mode only (limitation). For cache batch vs. batch in tf.keras 1) Added a hidden `capacity` to adjust the cache batch size 2) batch to be passed in tf.keras is unrelated to `capacity`, but we could use `rebatch` to change at the end of the pipeline. 3) `capacity` could be padded to allow `rebatch` to only cut a slice over one chunk. If not padded to `batch_size` in tf.keras, then `rebatch` likely will copy over boundary. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
…er mode (tensorflow#384) * Rework on ParquetDataset for easy access and better cache size in eager mode This fix is part of the effort to improve overall Dataset for easy access and better cache size in eager mode. See 382 and 366 for related discussions. In order to be able to read file either in filename or in mmeory, this PR adds an SizedRandomAccessFile which allows to provide an optional memory buffer as file content. This could be useful in process compression or archives where we could just read the uncompressed file content into memory. The preivous limitation in Dataset was that Dataset was a iterable so sequence length is unknown until graph runtime. In this PR, we provide an helper function to read the specs of parquet file and lenth is know. This also could open other avenues such as map parquet file with __getitem__ and __len__. Further, parquet file could be read into a Tensor and processed easily (such as pandas like API). The read_parquet_specs could be similarly applied to HDF5 which is more important: HDF5 could have dataset with different sizes. Summary: 1) Two basic C++ kernel ops are implemnted: read_parquet_specs and read_parquet 2) One ParquetDataset that is python implementation only (no C++ anymore) 3) ParquetDataset support eager and graph mode, in graph mode, dtype and shape are provided by user explicitly. In eager mode, only column name is needed. 4) read_parquet works in eager and graph mode, can read records either in full, or in slices 5) read_parquet_specs works in eager mode only (limitation). For cache batch vs. batch in tf.keras 1) Added a hidden `capacity` to adjust the cache batch size 2) batch to be passed in tf.keras is unrelated to `capacity`, but we could use `rebatch` to change at the end of the pipeline. 3) `capacity` could be padded to allow `rebatch` to only cut a slice over one chunk. If not padded to `batch_size` in tf.keras, then `rebatch` likely will copy over boundary. Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Fix build failures Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Rename read_parquet_columns => list_parquet_columns Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Remove batch args, and add test in graph mode Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
This fix is part of the effort to improve overall Dataset for easy access and better cache size in eager mode. See #382 and #366 for related discussions.
In order to be able to read file either in filename or in mmeory, this PR adds an SizedRandomAccessFile which allows to provide an optional memory buffer as file content. This could be useful in process compression or archives where we could just read the uncompressed file content into memory.
The preivous limitation in Dataset was that Dataset was a iterable so sequence length is unknown until graph runtime. In this PR, we provide an helper function to read the columns of parquet file and lenth is know.
This also could open other avenues such as map parquet file with getitem and len.
Further, parquet file could be read into a Tensor and processed easily (such as pandas like API).
The list_parquet_columns could be similarly applied to HDF5 which is more important: HDF5 could have dataset with different sizes.
Summary:
are provided by user explicitly. In eager mode, only column name is needed.
For cache batch vs. batch in tf.keras
capacity
to adjust the cache batch sizecapacity
, but we could userebatch
to change at the end of the pipeline.
capacity
could be padded to allowrebatch
to only cut a slice over one chunk.If not padded to
batch_size
in tf.keras, thenrebatch
likely will copy over boundary.Signed-off-by: Yong Tang yong.tang.github@outlook.com